Predicting compound-protein interaction using hierarchical graph convolutional networks

Motivation Convolutional neural networks have enabled unprecedented breakthroughs in a variety of computer vision tasks. They have also drawn much attention from other domains, including drug discovery and drug development. In this study, we develop a computational method based on convolutional neural networks to tackle a fundamental question in drug discovery and development, i.e. the prediction of compound-protein interactions based on compound structure and protein sequence. We propose a hierarchical graph convolutional network (HGCN) to encode small molecules. The HGCN aggregates a molecule embedding from substructure embeddings, which are synthesized from atom embeddings. As small molecules usually share substructures, computing a molecule embedding from those common substructures allows us to learn better generic models. We then combined the HGCN with a one-dimensional convolutional network to construct a complete model for predicting compound-protein interactions. Furthermore we apply an explanation technique, Grad-CAM, to visualize the contribution of each amino acid into the prediction. Results Experiments using different datasets show the improvement of our model compared to other GCN-based methods and a sequence based method, DeepDTA, in predicting compound-protein interactions. Each prediction made by the model is also explainable and can be used to identify critical residues mediating the interaction.


Introduction
The identification of compound-protein interactions (CPI) plays an important role in drug discovery and development. Typical applications include high-throughput screening of compound libraries for given protein targets to identify desirable compound-protein interactions or testing given compounds against possible off-target proteins to avoid undesired effects [1,2]. Experimental identification of every possible compound and protein pair is unpractical, if not impossible, due to the enormity of the chemical and proteomic space. Therefore, computational methods to predict compound-protein interactions have received increasing attention. Especially the adaptation of deep learning models to structured data has opened a new paradigm for pharmaceutical research. Given a compound-protein pair, CPI prediction methods aim to predict a binary value indicating whether the compound and the protein interact [3][4][5][6], a numeric value indicating their binding affinity [1,[7][8][9][10][11][12], or identify binding sites for a specific compound within the protein [13][14][15][16]. Existing CPI prediction methods are diverse in terms of feature engineering and machine learning models. They can be categorized into different classes. The first class consists of similarity-based models, which utilize drug-drug and target-target similarity matrices to infer possible interactions. This approach commonly applies nearest-neighbor and kernel-based classifiers as machine learning models. A non-exhaustive list of such methods includes [17][18][19][20]. While similarity measures vary greatly among methods, drug-drug similarity is generally based on chemical structure fingerprints, whereas target-target similarity typically depends on a sequence alignment score [8]. In the second class, several studies [21,22] have analyzed binding affinity matrices for interaction prediction. Matrix factorization techniques have been used in this case to reveal latent features for drugs and proteins. A study from 2017 [8] presented SimBoost, combining both similarity matrices and binding affinity matrices to construct features for drugs, targets and drug-target pairs. These features were used as the input of a Gradient Boosting Regression Tree, creating a model to predict binding affinity. The third class consists of learning models that utilise pre-defined features and fingerprints to classify binders from non-binders. For example, Cheng et al. [23] trained support vector machine (SVM) models, using Molecular ACCess System (MACCS) keys and 1400 PROFEAT protein descriptors as input features. Likewise, Yu et al. [24] used PROFEAT protein descriptors in combination with chemical compound descriptors to train Random Forest and SVM models. Lastly, Wen et al [25] built a Restricted Boltzmann Machine with Extended Connectivity Fingerprints (ECFP) and protein sequence composition.
As a result of important theoretical and practical developments in recent years, several deep learning based methods have been devised to improve CPI identification. Deep learning models with their complex architecture are able to learn abstract features from raw data, which can lead to improved performance. Convolutional neural networks (CNN) have shown impressive performance in a variety of tasks, such as image recognition, text classification and audio processing [26]. Building on this success, several studies have investigated the use of a 3D-CNN model to predict CPI using 3D-structural data [27,28]. Typically, the data is constructed by discretizing the protein-compound molecular structure into a 3D grid centered around the binding site. However, structural information regarding the molecular interaction between the protein and compound is not always available, limiting the scope of these methods. Recently, Zhao et al. [6] and Zong et al. [29] built association networks among drugs and targets, allowing them to learn features for drugs, targets or drug-target pairs using graph embedding learning algorithms such as graph convolutional networks (GCN) [30] or node2vec [31]. The drawback of these network-based methods is that they require retraining when new nodes are inserted and cannot predict associations for any unseen nodes.
Some of the most readily available data representations in CPI datasets are Simplified Molecular Input Line Entry System (SMILES) strings for compounds and amino acid sequences for proteins. Each SMILES string corresponds to a unique molecular graph that describes the structure of the compound. The nodes of these graphs represent atoms, while the edges represent the covalent bonds between atoms. Given the availability of large CPI data sets, several studies have investigated deep learning models that work directly with these data formats. In general, these models consist of three key components, each built from a set of distinct neural network layers. Two components are responsible for encoding compounds and proteins respectively, while the final component translates the output of the encoding layers into a CPI prediction. The predicting component is usually a fully-connected network, computing whether an interaction occurs or computing a binding affinity value. The protein is commonly encoded through sequence-based models. DeepDTA [11], DeepConv-DTI [9], GraphDTA [10], Tsubaki et al. [5], MT-DTI [12] and TransformerCPI [3] apply 1D-CNN layers to encode protein sequences, while DeepAffinity [1] and DeepCDA [7] combine 1D-CNN layers with recurrent neural network (RNN) or long short-term memory (LSTM) layers, respectively. The compound is encoded with sequence-based or graph-based models, depending on the input information. DeepDTA, DeepAffinity, DeepCDA, and MT-DTI developed sequence-based models which utilise SMILES strings directly, considering compounds as sequences of finite letters. In contrast, several studies have explored the molecular structure of compounds as input features. DeepConv-DTI [9] uses ECFP fingerprints combined with a fully-connected network while the studies [4,5,10] investigate in GCNs to learn structural features of compounds. Different GCN models have been implemented, including Graph Laplacian based GCN [32], Graph Attention Networks (GAT) [33] and Graph Isomorphism Networks (GIN) [34]. In the TransformerCPI method, a GCN was combined with Transformer Decoder, a sequence-based model, to generate an embedding for every atom of the compound. In 2019, Karimi et al. [1] compared performance of the two approaches for compound representation. They implemented a RNN working with SMILES string and the GCN proposed by the study [4]. Their result showed that the GCN does not outperform the RNN. We obtained the same result when we compared performance of a recent sequence-based model with some existing GCNs on human, C. elegans [35] and bindingdb [4] data. However, graph structures are more descriptive than SMILES strings, especially for model interpretation [1]. Therefore, developing a better graph-based model to represent compounds are still an active problem.
The deep learning methods for CPI prediction have shown to outperform non-deep learning methods on benchmark datasets [3,5,11]. In this study, we explore a novel deep learning approach to predict whether compounds and proteins interact, based on compound graphs and amino acid sequences. We focus more on building an efficient GCN to represent compound structure. Similar to existing deep learning methods, our model includes three components: a GCN to encode compounds, a 1D-CNN based model to encode proteins, and a fully connected network to predict CPI. We postulate that looking at the local parts of data objects independently can improve the performance of trained models, especially for small molecules, which usually share many substructures. From this rationale, we introduce a procedure to construct the hierarchical representation for compounds in which compounds are considered as graphs of substructures, while the substructures are considered as graphs of atoms. Based on this approach, we propose a hierarchical GCN to model compounds. The hierarchical GCN learns an embedding for atoms, substructures, and then entire graphs, respectively. For the 1D-CNN based model encoding proteins, we combine 1D-CNN layers with pooling layers as well as fully-connected layers. Validation experiments on four different CPI datasets show that the proposed method provides an improved performance compared to existing GCN-based and string-based models. In addition, we applied Grad-CAM [36] on our model to evaluate the contribution of specific regions within protein sequences to the prediction.

Data pre-processing
The input of our model consists of proteins, represented as amino acid sequences, and compounds, represented as graphs. Amino acid sequences are projected into sequences of integers in which each integer represents a unique amino acid. Compound graphs are generated from SMILES strings using RDKit [37]. Compound graphs are then broken into subgraphs to account for aromatic links and atomic bonds. Our compound breaking procedure can be summarized as follows: First, the compound is split into two parts, one consisting of aromatic links and the other of non-aromatic links. This separation creates new graphs that can have more than one connected component in which each connected component corresponds to one specific substructure. Then, the connected component finding algorithm is used to detect the substructures. Finally, virtual edges are added between the substructures with which they share at least one atom in the original graph. This procedure generates a hierarchical representation of compounds. Fig 1 is an example of constructing the hierarchical representation for the compound with formula C 25 H 25 ClN 6 O. The resulting representation is a reduced graph including eight nodes and seven virtual edges: two nodes are Benzenes, one is Quinazoline and the other nodes are non-aromatic substructures. These nodes are also graphs, of which the nodes are atoms and the edges are connections between the atoms. For compounds consisting of only aromatic or non-aromatic bonds, the hierarchical representation contains a single node corresponding to the entire original graph.

Compound representation
In this section, we present in detail the hierarchical graph convolutional network (HGCN) to encode compounds. In general, a graph convolutional network (GCN) is a neural network architecture defined according to graph structure G ¼ ðV; EÞ. Nodes v 2 V take unique values from 1; ; jVj and edges in E are pairs ðu; vÞ 2 V � V. Graphs may contain node labels l v 2 f1; ; L V g for each node and edge labels l ðu;vÞ 2 f1; ; L E g for each edge. Given initial node features x v and edge features e uv , the aim of the GCN is to learn a state embedding h v for every node v in G. The node embedding can then be used to synthesize the graph embedding of G. The forward propagation of GCNs includes two phases: message passing and readout [38]. The message passing phase computes node embedding through an iterative procedure which is defined in terms of message functions M t and node update functions U t , as follows: Here, Aggr denotes the aggregation function which can be a sum, mean or max function; N(v) are the neighbors of node v in graph G. As the functions M t and U t are typically shared over all locations in the graph, the neural network architecture is referred to as a convolutional neural network. At the readout phase, a function R is applied to compute the state embedding for the entire graph based on the node embedding at the last iteration T.
Existing GCN models vary in how these M t , U t , and R functions are defined.
Within the proposed HGCN, we compute the graph-level embedding based on the subgraph-level embedding which in turn is aggregated from the node-level embedding. The messages are passed around atoms within the subgraph scope using message functions M ðaÞ t and update functions U ðaÞ t . The communication among subgraphs is not considered in this step. After T (a) iterations, the initial state embedding for every subgraph is assigned using a readout function R (sg) . Another message passing phase is running at the subgraph-level, in which messages are passed around through virtual edges with message functions M ðsgÞ t and update functions U ðsgÞ t . Finally, a readout function R ðGÞ is exerted to aggregate the state embedding for the entire graph G. The node embedding is computed as follows: Here [a, b] denotes the concatenation of two vectors a and b in their last dimension; x v are one-hot vectors corresponding to atom labels. The one-hot vectors can be replaced by predefined features of atoms. The embedding of every node v is concatenated with the signals from its neighbors and then the resultant vector is passed a fully-connected (FC) layer to obtain the update embedding of v. We choose concatenating over adding directly the neighbors' signals to the embedding of v as they have different meanings and these meanings are learnable through learning the weights of the FC layer. At subgraph-level, we use the same message passing scheme at node-level but the initial state embedding for subgraphs c i are aggregated from their atoms.
where N(c i ) are the neighboring subgraphs of c i according to the virtual edges; W ðaÞ t , W ðsgÞ t , b ðaÞ t and b ðsgÞ t are learned parameters. After T sg time steps, we compute graph embedding of G according to: Within this study, we run the message passing phase at atom level with T (a) = 2 time steps and at subgraph level with T (sg) = 1 time steps. Each time step is associated with a neural network layer. Batch normalization layers are used after each time step at the atom level. The convolution layer at subgraph-level takes into account the interactions between these components to update their embedding and finally generates the embedding for the entire graph.

Protein representation
Fig 3 demonstrates the model we built to encode protein sequences. We used a 1D-CNN based model, including three 1D-CNN layers of which two are followed by max pooling layers, and two fully-connected (FC) layers at the end. We fix the size of the kernels in the first 1D-CNN layer and the last max pooling layer to 3 and 12 respectively. For the other 1D-CNN and max pooling layers, this hyper-parameter can be customized by users. First, protein sequences are transformed into sequences of numerical vectors by an embedding layer, where every vector corresponds to an amino acid. These sequences then pass through the 1D-CNN and max pooling layers, downsampling into shorter sequences. When the last max pooling layer is finished, the resulting sequences are flattened into large vectors. The two FC layers at the end compress these vectors into smaller vectors that are the final representation of the protein.
As protein sequences have varying length, we chose to fix the maximum length of input sequences using the length of the longest protein in the training data. The sequences that are longer than this maximum length are truncated, whereas shorter sequences are 0-padded. Additionally, we added two more special symbols into the input sequences in order to mark their begin and end position. As a result, the vocabulary for the embedding layer consists of 24 words in which 22 words correspond to amino acids and 2 words correspond to the begin/end symbols. The 22 amino acids include the 20 of the standard genetic code and an additional 2 that can be incorporated by special translation mechanisms, Selenocysteine (U) and Pyrrolysine (O).

CPI prediction model
To predict a CPI, the compound embedding and the protein embedding is concatenated into a single vector. These vectors represent compound-protein pairs and are used as input features for a four layer fully-connected neural network (FCNN). The outcome of this network is a vector z 2 R 2 . A softmax layer is applied on top of the output vectors z to generate the CPI probability according to: Here, k 2 {0, 1} denotes class label, x is the set of compound-protein pair features and z k is the value at the k-th dimension of vector z. P(y = k|x) is the probability for the k-th class (i.e. interact or not) given a CPI x. The compound embedding and protein embedding are generated by the HGCN and 1D-CNN based model described in previous sections. To reduce over-fitting, we applied dropout p for the two last layers of the 1D-CNN model. Fig 4 presents an overview of our CPI prediction model. In this study, all models are trained with learning rate alpha = 0.0001, dropout p = 0.2 and Adam [39] is used as the optimization algorithm in the training process. The dimensions of the initial embeddings and the final embeddings for both compounds and proteins are 64 and 128, respectively. We implemented our model using the Pytorch and Pytorch Geometric library.

Experiment and result
To evaluate our model and compare its performance to other deep learning based methods, we use four independent datasets. The compared methods include GCN [5], GraphDTA [10], TransformerCPI [3] and DeepDTA [11]. We also implemented another model which has the same architecture as our model but it does not use the substructure splitting information. For simplicity, we denote this method as HGCN-nosplit while the proposed method is denoted as HGCN. Among these methods, DeepDTA directly uses the SMILES format string as input and constructs a sequence-based model to encode compounds while the others employed different GCNs. DeepDTA and GraphDTA by default predict the binding affinity value between compounds and proteins. We modified their output layer from generating a float value into generating a binary value which allows us to directly compare the performance of these models. All the models are trained using the early stopping based on validation data, with a delay of n = 10 successive epochs. The Datasets consist of human and C.elegans from [35], bindingdb created by [4] and chembl27 downloaded from chEMBL database [40]. Except for the chembl27 dataset, all datasets have been used to benchmark CPI prediction methods in previous studies [3][4][5]7]. For the human and C.elegans datasets, Liu et al. [35] retrieved positive samples from two manually curated databases DrugBank 4.1 [41] and Matador [42], while they obtained negative samples using an in silico screening method. Gao et al. [4] created bindingdb dataset from BindingDB [43], a database of measured binding affinities, focusing primarily on the interactions of small molecules and proteins. The samples are labeled as positive if their IC50 is less than 100nM and negative if their IC50 is larger than 10, 000nM. For the chembl27 dataset, we collected samples based on the binding activities of small molecules from the chEMBL database [44], only retaining the entries with a standard measure IC50. The thresholds to assign samples as positives or negatives are � 100nM and � 1000nM respectively. Table 1 shows the number of compounds, proteins, positive samples and negative samples for each dataset. With approximately 8000 samples the human and C.elegans datasets are substantially smaller than the other datasets, whereas they are more balanced in compound and protein quantity. Table 2 additionally offers an overview of the training, testing and validation dataset sizes. We evaluate the models on small datasets using k-fold cross validation, with k = 5. Meanwhile, the larger datasets are split into training, validation and testing subsets. To check the similarity between the training and testing data, we compute pairwise local alignment for proteins and Morgan fingerprint based similarity with radius of 2. The average similarity between the training and testing data that we obtained for the proteins and the compounds are less than 0.33 and 0.28, respectively. Table 3 provides an insight to the testing data of all datasets and enumerates the number of samples in each case study, including new compound-known protein, known compoundnew protein, new compound-new protein and known compound-known protein. A "known" compound/protein means the element has been seen in training data while a "new" compound/protein has the opposite meaning. Note that the known compound-known protein pair signifies an unseen pair built up from a compound and protein that are both in the

Comparison in performance
For simplicity, all the customizable layers in the protein encoder were assigned the same kernel size. Validation data is used to tune the hyper-parameters, mainly for the kernel size in the customizable layers. The tuning process showed that k = 6 is sufficient for most experimental datasets. As a result, we choose k = 6 as the kernel size of our model in this experiment. We also trained our model with the same learning rate and weight decay, 1e-4, for all datasets. Table 4 shows MSE, F1 and AUC score of the prediction models on the experimental data. In the case of C. elegans and human data, we computed K-fold cross validated paired t test to compare our method with the others. We write a symbol ' � ' next to the score in which the corresponding method is significantly different from our method with p-value < 0.05. The improvement in the performance of HGCN over HGCN-nosplit shows the beneficit of splitting compounds into substructures to CPI prediction. Synthesizing the substructure embeddings independently before integrating them into the compound embedding gives a better prediction model. The tables also show that the performance of all methods is high on both C. elegans and human data, with F1 scores higher than 0.90 for both datasets. Meanwhile, performance on bindingdb and chembl27 is much lower for all models. For example, the GCN method achieves approximately 0.700 in terms of F1 score on these datasets. These results can be explained by the degree of overlap in proteins and/or compounds between the training and the testing data, shown in Table 3. Our method is only significantly better than the GCN method on C.elegans data. However, it becomes significantly better than the GCN and Trans-formerCPI on human data while its performance is not significantly different from the others on this dataset. In the case of two large datasets, bindingdb and chembl27, our method outperforms the other methods in predicting CPIs.
To allow a more detailed evaluation, we applied the trained models on the testing data, specifically on four subsets corresponding to four different scenarios, as mentioned in Table 3. Table 5 presents the AUC scores of the methods in each scenario for each dataset. In general, DeepDTA outperforms the other methods in predicting CPIs with known compound and known protein. In this scenario, our method outperforms the remaining ones, including GCN, TransformerCPI, GraphDTA and HGCN-nosplit. For new compound-known protein samples, our model achieves a performance similar to DeepDTA and GraphDTA but lower than HGCN-nosplit. If we narrow down the comparison on the large datasets, our method Table 3

. A summary of CPIs in the testing data of two datasets bindingdb and chembl27 in four scenarios: (a) new compound-known protein, (b) known compoundnew protein, (c) new compound-new protein, and (d) known compound-known protein.
The two last columns indicate percentage of (c) and (d) in testing data.

dataset (a) (b) (c) (d) %(c) %(d)
C gives better prediction than HGCN-nosplit. Furthermore, it outperforms the others in terms of average ranking in the other scenarios, known compound-new protein and new compound-new protein.

Effect of the kernel size
As the k-mer is a common concept related to protein sequences, with biological meaning, we have focused on the kernel size in 1D-CNN layers and max pooling layers and examine how it affects to our model performance. We changed the kernel size in the customizable layers to 6, 8, 10, 12 and trained different versions of our model. Fig 5 shows AUC scores of these trained predictors on testing data for two datasets: human and bindingdb. For the human dataset, the AUC score slightly changes as a function of the kernel size, reaching the highest value at the kernel size k = 10. The AUC value of the model trained with the kernel size k = 6 is also promising, which is lower than the highest value merely 0.1%. For the bindingdb dataset, the performance generally decreases when the kernel size increases from k = 6 to k = 12, with an AUC difference of 3%. This indicates that kernel size can have an effect on the performance of the model. The kernel size defines the number of adjacent amino acids in proteins that would be considered in one patch. As the ideal split can be different among proteins, the ideal kernel size is also different among datasets. With the result we obtained from four experimental datasets, a kernel size of 6 is a good option to use.
To get an insight on how the kernel size impacts our model, we also examined the performance of our model trained on different kernel size in the four different scenarios. Figs 6 and 7 show the result of the examination on human and bindingdb data, respectively. In the scenarios of new compound-known protein and known compound-known protein, our results indicate slight differences in AUC score of our model when it is trained with the kernel size of 6, 8, 10 and 12. In other words, changing the kernel size does not affect much on the prediction

Explanation with Grad-CAM
Grad-CAM [36] can be used to produce a visual explanation of the decisions made by CNNbased models, without modifying the base models or requiring re-training. Grad-CAM is applicable to any CNN-based architecture, including those for image captioning and visual question answering. Grad-CAM operates by flowing gradient information back to the last convolutional layer, where importance values are assigned to each neuron for a particular decision of interest. These importance values are used to compute a class-discriminative localization map. To obtain the full explanation, we can interpolate the map and project the result onto the original input. For 2D-CNN-based models, the class-discriminative localization map Grad-CAM L c GradÀ CAM 2 R u�v of width u and height v for any class c is computed as follows: • Neuron importance weights a c k are computed using the gradient of the score for class c, y c , with respect to the feature map activations A k in the last convolutional layer, i.e. @y c @A k .
• The neural importance weight a c k captures the "importance" of feature map k for the target class c. These are then combined with forward activation maps, followed by a RELU function to produce, The ReLU is applied to the linear combination of maps in order to focus only on the features that have a positive influence on the class of interest.
For the 1D-CNN-based model, the Eq 13 is replaced by: where L, A k denote the length and the whole dimension k-th of the feature map, respectively. Fig 8 is an example of a class-discriminative localization map L c GradÀ CAM that Grad-CAM explains for a CPI prediction generated by our model. This example represents the interaction between Protein Kinase R-like Endoplasmic Reticulum Kinase (PERK) and the small molecule 3Z1 (PDB ID 4X7J). PERK is one of three sensors of misfolded proteins that are known to mediate the unfolded protein response (UPR) through complementary pathways [45]. The two heat maps correspond to two different models trained with kernel size k = 6 and k = 12, respectively. The heat maps reveal the residues predicted to play the most important role in the interaction. With a smaller kernel size, each signal at the last 1D-CNN layer is associated with a smaller region in the input. Therefore, we obtain a more detailed explanation with a smaller kernel size, focusing on shorter regions, compared to larger kernel sizes. Fig 9 shows the 3D visualization of the same CPI in which (A) is the structure obtained from PDB, while (B) and (C) highlight the importance of the residues according to the prediction models trained with kernel size k = 6 and k = 12, respectively. The prediction models have concentrated on several binding site residues that are within 5Å from the ligand when making their decision. However, other residues were also taken into account that are far from the ligand. When comparing the two prediction models, the model with kernel size k = 6 is focused more on the residues surrounding the ligand in a radius of 5Å than the model with the kernel size k = 12. Fig 10 is another example of using Grad-CAM to explain CPI prediction given by our model. We trained this model with the kernel size of 6. The example illustrates the CPIs related to the ligand 1S7 and two proteins, AmpC beta-lactamase in Escherichia coli K-12 (PDB ID 4KEN) and Penicillin-binding protein 4 (PDB ID 7KCX). Compared to the ligand interaction retrieved from PDB database, our model has recruited a subset of the residues that are within 5Å from the ligand as the important features in its prediction. We also observe that other residues which are far from the ligand play important role in the prediction.

Conclusion
We proposed a new hierarchical graph convolutional network based technique which is able to produce three levels of embedding: node, subgraph and entire graph. We used this GCN model as a compound encoder, combining it with a 1D-CNN based model in order to build a complete CPI prediction framework. To generate subgraphs in compounds, we developed a simple procedure based on aromatic links and connected components. The experimental   results demonstrate that our framework outperforms existing GCN based models on different CPI datasets. Compared to DeepDTA, a highly performing sequence-based CPI prediction method, our model is competitive, showing improved performance when large training data is available. In addition, the visual explanation provided by Grad-CAM shows the relevance of the involved residues for the proposed model. This offers a promising extension towards the prediction of binding sites between compounds and proteins.
We employed different types of neural network layers to build the CPI prediction model, recruiting basic information from compounds and proteins as input. The embedding of compounds and proteins are learnt merely from CPI data during the training process. However, there is a large number of small molecules and protein sequences available in public databases that can be exploited to learn compound and protein embeddings through unsupervised approaches. Pre-trained embedding has the potential to make learning significantly easier and cheaper. In addition, attention mechanisms have proven their efficiency in several deep learning models. Integrating a relevant attention mechanism into our CPI prediction framework may further improve its performance. We leave this as an open avenue for exploration in future work. The visual explanation we provide on the protein sequences suggests possible binding sites which offers opportunities to generate new biological insights in protein function and opens applications in the context of drug discovery and re-purposing. As the incorporation of compound substructure information yields good CPI prediction performances, we argue that substructure analysis can be a valuable step in computational drug design. Laukens.